Tree-based Regression

Code
library(tidyverse)
library(data.table)
library(rpart)
library(rattle)
library(wooldridge)
library(reticulate)
use_python("/usr/local/bin/python3")

Regression tree

What is it?

Here is an example of regression tree to explain logged salary (lsalary) using the mlb1 data from the wooldridge package.

Code
#=== get mlb1 data (from wooldridge) ===#
data(mlb1)

#=== build a simple tree ===#
simple_tree <-
  rpart(
    lsalary ~ hits + runsyr, 
    data = mlb1, 
    control = rpart.control(minsplit = 200)
  )

fancyRpartPlot(simple_tree)

Here is how you read the figure. At the first node, all the observations belong to it (\(n=353\)) and the estimate of lsalary is 13. Now, the whole datasets are split into two based on the criteria of whether hits is less than 262 or not. If yes, then such observations will be grouped into the node with “2” on top (the leftmost node), and the estimated lsalary for all the observations in that group (\(n = 132\)) is 12. If no, then such observations will be grouped into the node with “3” on top, and the estimated lsalary for all the observations in that group (\(n = 221\)) is 14. This node is further split into two groups based on whether runsyr is less than 44 or not. For those observations with runsyr \(< 44\) (second node a the bottom), estimated lsalary is 14. For those with runsyr \(>= 44\) (rightmost node at the bottom), estimated lsalary is 15. The nodes that do not have any further bifurcations below are called terminal nodes or leafs.

As illustrated in the figure above, a regression tree splits the data into groups based on the value of explanatory variables, and all the observations in the same group will be assigned the same estimate (the sample average of the dependent variable of the group).

Another way of illustrating this grouping is shown below:

Code
ggplot(mlb1) +
  geom_point(aes(y = hits, x = runsyr, color = lsalary)) +
  scale_color_viridis_c() +
  geom_hline(yintercept = 262) +
  geom_line(
    data = data.table(x = 44, y = seq(262, max(mlb1$hits), length = 100)), 
    aes(y = y, x = x)
  ) +
  annotate(
    "text", 
    x = 44, y = 111, 
    label = "Region 2", 
    color = "red"
  ) +
  annotate(
    "text", 
    x = 22, y = 1500, 
    label = "Region 6", 
    color = "red"
  ) +
  annotate(
    "text", 
    x = 75, y = 1500, 
    label = "Region 7", 
    color = "red"
  )

The mechanism called recursive binary splitting is used to split the predictor space like the example above. Suppose you have K explanatory variables (\(X_1, \dots, X_k\)). Further, let \(c\) denote the cutpoint that splits the sample into two regions: {\(X|X_k < c\)} and {\(X|X_k \geq c\)}.

{\(X|X_k < c\)} means observations that satisfy the condition stated right to the vertical bar (|). Here, it means all the observations for which its \(X_k\) value is less than \(c\).

  • Step 1: For each of the explanatory variables (\(X_1\) through \(X_K\)), find the cutpoint that leads to the lowest sum of the squared residuals.
  • Step 2: Among all the splits (as many as the number of explanatory variables), pick the variable-cutpoint combination that leads to the lowest sum of the squared residuals.

The data is then split according to the chosen criteria and then the same process is repeated for each of the branches, ad infinitum until the user-specified stopping criteria is met.

Let’s try to write (an inefficient version) this process for the first split from the beginning node using the mlb1 data as an illustration based on a simple grid search to find the optimal cutpoints (Step 1).

Code
#=== get data ===#
library(wooldridge)
data(mlb1)

mlb1_dt <- 
  mlb1 %>% 
  data.table() %>% # turn into data.table 
  .[, salary := NULL] %>% # remove salary (use lsalary instead)
  na.omit() # remove observations with NA in any of the variables

Let’s work on splitting based on hruns. First, we define a sequence of values of cutpoints for hruns.

Code
value_seq <- 
  quantile(
    mlb1_dt$hruns, 
    prob = seq(0.001, 0.999, length = 100)
  ) %>% 
  unique()

For each value in value_seq, we find the RSS. For example, for the 50th value in value_seq,

Code
copy(mlb1_dt) %>% 
  #=== find the mean of lsalary by whether hruns is less than the cutpoint or not ===#
  .[, y_hat := mean(lsalary), by = (hruns < value_seq[50])] %>% 
  #=== get squared residuals ===#
  .[, (lsalary - y_hat)^2] %>% 
  #=== get RSS ===#
  sum()
[1] 318.2431

How about 70th value in value_seq?

Code
copy(mlb1_dt) %>% 
  #=== find the mean of lsalary by whether hruns is less than the cutpoint or not ===#
  .[, y_hat := mean(lsalary), by = (hruns < value_seq[70])] %>% 
  #=== get squared residuals ===#
  .[, (lsalary - y_hat)^2] %>% 
  #=== get RSS ===#
  sum()
[1] 419.1821

This means value_seq[70] (209.0880707) is a better cutpoint than value_seq[50] (71.7563535).

Okay, let’s consider all the candidate values, not just 50th and 70th, and then pick the best.

Code
get_rss <- function(i, var_name, value_seq, data)
{
  rss <-
    copy(data) %>% 
    setnames(var_name, "var") %>% 
    .[, y_hat := mean(lsalary), by = (var < value_seq[i])] %>% 
    .[, (lsalary - y_hat)^2] %>% 
    sum()

  return_data <-
    data.table(
      rss = rss,
      var_name = var_name,
      var_value = value_seq[i]
    )

  return(return_data)
}

Here are RSS values at every value in value_seq.

Code
rss_value <-
  lapply(
    seq_len(length(value_seq)),
    function(x) get_rss(x, "hruns", value_seq, mlb1_dt) 
  ) %>% 
  rbindlist()

head(rss_value)
        rss var_name var_value
1: 445.0615    hruns         0
2: 396.8287    hruns         1
3: 383.2393    hruns         2
4: 362.5779    hruns         3
5: 337.0760    hruns         4
6: 325.4865    hruns         5
Code
tail(rss_value)
        rss var_name var_value
1: 419.1821    hruns  209.0881
2: 419.5111    hruns  231.0000
3: 428.4163    hruns  242.4425
4: 434.3318    hruns  258.5674
5: 441.9307    hruns  333.4414
6: 443.3740    hruns  426.0780

Finding the cutpoint value that minimizes RSS,

Code
rss_value[which.min(rss), ]
        rss var_name var_value
1: 260.3094    hruns  28.47488

Okay, so, the best cutpoint for hruns is 28.475

Suppose we are considering only five explanatory variables in building a regression tree: hruns, years, rbisyr, allstar, runsyr, hits, and bavg. We do the same operation we did for hruns for all the variables.

Code
get_rss_by_var <- function(var_name, data)
{
  temp_data <- copy(data) 

  #=== define a sequence of values of hruns ===#
  value_seq <- 
    quantile(
      temp_data[, ..var_name] %>% unlist(), 
      prob = seq(0.001, 0.999, length = 100)
    ) %>% 
    unique()

  #=== get RSS ===#
  rss_value <-
    lapply(
      seq_len(length(value_seq)),
      function(x) get_rss(x, var_name, value_seq, temp_data) 
    ) %>% 
    rbindlist() %>% 
    .[which.min(rss),]

  return(rss_value)
}

Looping over the set of variables,

Code
(
min_rss_by_var <-
  lapply(
    c("hruns", "years", "rbisyr", "allstar", "runsyr", "hits", "bavg"),
    function(x) get_rss_by_var(x, mlb1_dt)
  ) %>% 
  rbindlist()
)
        rss var_name  var_value
1: 260.3094    hruns  28.474879
2: 249.9090    years   4.000000
3: 265.1082   rbisyr  32.774949
4: 279.7165  allstar   8.209356
5: 251.6343   runsyr  38.079145
6: 205.1488     hits 354.811444
7: 375.0281     bavg 252.359257

So, the variable-cutpoint combination that minimizes RSS is hits - 354.81. We now have the first split. This tree is developed further by splitting nodes like this.

Training a regression tree in R

You can fit a regression tree using rpart() from the rpart package. Its syntax is similar to that of lm() for a quick fitting.

Code
rpart(
  formula,
  data
)

Using mlb1, let’s fit a regression tree where lsalary is the dependent variable and hruns, years, rbisyr, allstar, runsyr, hits, and bavg are the explanatory variables.

Code
#=== fit a tree ===#
fitted_tree <-
  rpart(
    lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, 
    data = mlb1_dt
  )

Here is the visualization of the fitted tree using fancyRpartPlot() from the rattle package.

Code
fancyRpartPlot(fitted_tree)

Now, you may wonder why rpart() is not building a tree that has as many leaves as the number of observations so that we have a perfect prediction for the train data (mlb1). If we are simply implementing recursive binary splitting, then it should not have stopped where it stopped. This is because rpart() sets parameter values that control the development of a tree by default. Those default parameters can be seen below:

Code
rpart.control()
$minsplit
[1] 20

$minbucket
[1] 7

$cp
[1] 0.01

$maxcompete
[1] 4

$maxsurrogate
[1] 5

$usesurrogate
[1] 2

$surrogatestyle
[1] 0

$maxdepth
[1] 30

$xval
[1] 10

For example, minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted. cp refers to the complexity parameter. For a given value of cp, a tree is build to minimize the following:

\[ \sum_{t=1}^T\sum_{x_i\in R_t} (y_i - \hat{y}_{R_t})^2 + cp\cdot T \]

where \(R_t\) is the \(t\)th region and \(\hat{y_{R_t}}\) is the estimate of \(y\) for all the observations that reside in \(R_t\). So, the first term is RSS. The objective function has a penalization term (the second term) just like shrinkage methods we saw in ?@sec-shrinkage. A higher value of cp leads to a less complex tree with less leaves.

If you want to build a much deeper tree that has many leaves, then you can do so using the control option like below.

Code
full_tree <-
  rpart(
    lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, # formula
    data = mlb1_dt, # data
    control = # control of the hyper parameters
      rpart.control(
        minsplit = 2, 
        cp = 0 # complexity parameter
      )
  )

Let’s see how amazing this tree is by comparing the observed and fitted lsalary values.

Code
#=== get fitted values ===#
mlb1_dt[, y_hat := predict(full_tree, newdata = mlb1_dt)]

#=== visualize the fit ===#
ggplot(data = mlb1_dt) +
  geom_point(aes(y = lsalary, x = y_hat)) +
  geom_abline(slope = 1, color = "red")

Yes, perfect prediction accuracy! At least for the train data anyway. But, we all know we want nothing to do with this kind of model. It is clearly over-fitting the train data.

In order to find a reasonable model, we can use KCV over cp. Fortunately, when we run rpart(), it automatically builds multiple trees at different values of cp that controls the number of leaves and conduct KCV. You can visualize this using plotcp().

Code
plotcp(fitted_tree)

MSE and cp are presented on the y- and x-axis, respectively. According to the KCV results, cp \(= 0.018\) provides the tree with the smallest number of leaves (the most simple) where the MSE value is within one standard deviation from the lowest MSE. You can access the tree built under cp \(= 0.018\) like below.

Code
#=== get the best tree ===#
best_tree <- prune(full_tree, cp = 0.018)

#=== visualize it ===#
fancyRpartPlot(best_tree)

Even though how a regression tree is build in R. In practice, you never use a regression tree itself as the final model for your research as its performance is rather poor and tend to over-fit compared to other competitive methods. But, understanding how building a regression tree is important to understand its derivatives like random forest, boosted regression forest.

Random Forest (RF)

Regression tree approach is often not robust and suffers from high variance. Here, we look at the process called bagging and how it can be used to train RF model, which is much more robust than a regression tree.

Bagging (Bootstrap aggregation)

Consider two random variables \(x_1\) and \(x_2\) from the identical distribution, where \(E[x_i] = \alpha\) and \(Var(x_i) = \sigma^2\). You are interested in estimating \(E[x_i]\). We all know that the following relationship holds in general:

\[ \begin{aligned} Var(\frac{x_1 + x_2}{2}) & = \frac{Var(x_1)}{4} + \frac{Var(x_2)}{4} + \frac{Cov(x_1, x_2)}{2} \\ & = \frac{\sigma^2}{2} + \frac{Cov(x_1, x_2)}{2} \end{aligned} \]

So, instead of using a single draw from \(x_1\) and using it as an estimate for \(E[x_i]\), it is better to use the values from both \(x_1\) and \(x_2\) and average them to obtain an estimate for \(E[x_i]\) as long as \(x_1\) and \(x_2\) are not perfectly positively correlated (in this case \(Cov(x_1, x_2) = Var(x_1) = Var(x_1) = \sigma^2\)). The benefit of averaging is greater when the value of \(Cov(x_1, x_2)\) is smaller.

Let’s do a little experiment to see this. We consider three cases:

Code
#=== set the number of observations to 1000 ===#
N <- 1000
Code
#=== first case (no correlation) ===#
x_1 <- rnorm(N)
x_2 <- rnorm(N)

cor(x_1, x_2)
[1] -0.0006770345
Code
#=== second case (positively correlated) ===#
x_1 <- rnorm(N)
x_2 <- 0.5 * x_1 + sqrt(1-(0.5)^2) * rnorm(N)

cor(x_1, x_2)
[1] 0.4976275
Code
#=== third case (negatively correlated) ===#
x_1 <- rnorm(N)
x_2 <- - 0.8 * x_1 - sqrt(1-(0.8)^2) * rnorm(N)

cor(x_1, x_2)
[1] -0.8257903
Code
get_alpha <- function(i)
{
  #=== base case ===#
  alpha_hat_0 <- rnorm(1)

  #=== first case (no correlation) ===#
  x_1 <- rnorm(1)
  x_2 <- rnorm(1)

  alpha_hat_1 <- (x_1 + x_2) / 2

  #=== second case (positively correlated) ===#
  x_1 <- rnorm(1)
  x_2 <- 0.5 * x_1 + sqrt(1-(0.5)^2) * rnorm(1)

  alpha_hat_2 <- (x_1 + x_2) / 2

  #=== third case (negatively correlated) ===#
  x_1 <- rnorm(1)
  x_2 <- - 0.8 * x_1 - sqrt(1-(0.8)^2) * rnorm(1)

  alpha_hat_3 <- (x_1 + x_2) / 2

  return_data <-
    data.table(
      alpha_hat_0 = alpha_hat_0,
      alpha_hat_1 = alpha_hat_1,
      alpha_hat_2 = alpha_hat_2,
      alpha_hat_3 = alpha_hat_3
    )

  return(return_data)

} 
Code
set.seed(234934)

sim_results <-
  lapply(
    1:1000,
    get_alpha
  ) %>% 
  rbindlist() %>% 
  melt()
Warning in melt.data.table(.): id.vars and measure.vars are internally
guessed when both are 'NULL'. All non-numeric/integer/logical type columns are
considered id.vars, which in this case are columns []. Consider providing at
least one of 'id' or 'measure' vars in future.

As you can see below, they are all pretty much unbiased. However, all the cases that averaged two values (cases 1, 2, and 3) outperformed the base case that relied on a single value each iteration. You can see that when the random variables are negatively correlated, the power of averaging is greater compared to when they are independent or positively correlated. The independent case (case 1) is better than the positive correlation case (case 2).

Code
#=== expected value ===#
sim_results[, mean(value), by = variable]
      variable          V1
1: alpha_hat_0 0.037836191
2: alpha_hat_1 0.008031971
3: alpha_hat_2 0.008486035
4: alpha_hat_3 0.004259832
Code
#=== standard error ===#
sim_results[, sd(value), by = variable]
      variable        V1
1: alpha_hat_0 0.9840232
2: alpha_hat_1 0.6969730
3: alpha_hat_2 0.8660699
4: alpha_hat_3 0.3103408

Bagging takes advantage of the power of averaging. Specifically, bagging takes the following steps:

  1. Generate many bootstrapped datasets (say \(B\) datasets)
  2. Train a model on each of the bootstrapped datasets (\(\hat{f}^1, \dots, \hat{f}^B\))
  3. Average the estimates from all the trained models to come up with an estimate

\[ \hat{f}(X) = \frac{\hat{f}^1(X) + \dots + \hat{f}^B(X)}{B} \]

Let’s implement this for \(B = 10\) using mlb1_dt. First, define a function (named train_a_tree()) that bootstrap data, fit a regression tree, and then return the fitted values.

Code
train_a_tree <- function(i, data)
{
  #=== number of observations ===#
  N <- nrow(data)

  #=== bootstrapped data ===#
  boot_data <- data[sample(1:N, N, replace = TRUE), ]

  #=== train a regression tree ===#
  rpart <-
    rpart(
      lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, 
      data = boot_data
    )

  #=== predict ===#
  return_data <-
    copy(data) %>% 
    .[, y_hat := predict(rpart, newdata = data)] %>% 
    .[, .(id, y_hat)]

  return(return_data)
}

We now repeat train_a_tree() 10 times.

Code
#=== create observation id for later group-by averaging ===#
mlb1_dt[, id := 1:.N]

(
y_estimates <-
  lapply(
    1:10,
    function(x) train_a_tree(x, mlb1_dt) 
  ) %>% 
  rbindlist()
)
       id    y_hat
   1:   1 15.15792
   2:   2 14.23816
   3:   3 15.15792
   4:   4 14.23816
   5:   5 14.23816
  ---             
3296: 326 13.17676
3297: 327 12.64863
3298: 328 12.64863
3299: 329 13.61157
3300: 330 12.02302

By averaging \(y\) estimates by id, we can get bagging estimates.

Code
y_estimates[, mean(y_hat), by = id]
      id       V1
  1:   1 15.09137
  2:   2 14.66143
  3:   3 14.61425
  4:   4 14.31035
  5:   5 13.74852
 ---             
326: 326 13.04455
327: 327 12.79717
328: 328 12.79717
329: 329 13.67179
330: 330 12.02051

Now, let’s take a look at the individual estimates of \(y\) for the first observation.

Code
y_estimates[id == 1, ]
    id    y_hat
 1:  1 15.15792
 2:  1 15.04446
 3:  1 15.21238
 4:  1 15.07311
 5:  1 14.92840
 6:  1 14.98571
 7:  1 15.12060
 8:  1 15.14064
 9:  1 15.05676
10:  1 15.19375

Hmm, the estimates look very similar. Actually, that is the case for all the observations. This is because the trained trees are very similar for many reasons, and the trees are highly “positively” correlated with each other. From our very simple experiment above, we know that the power of bagging is not very high when that is the case. While RF does use bagging, popular R and python packages does it in a much better way than I demonstrated here. We see this next.

Random Forest (RF)

Unlike a naive bagging approach I demonstrated above, RF does it in a clever way to decorrelate trees. Specifically, for any leave of any tree, they consider only a randomly select subset of the explanatory variables when deciding how to split a leave. A typical choice of the number of variables considered at each split is \(\sqrt{K}\), where \(K\) is the number of the explanatory variables specified by the user. In the naive example above, all \(K\) variables are considered for all the split decisions of all the trees. Some variables are more influential than others and they get to be picked as the splitting variable at similar places, which can result in highly correlated trees. Instead, RF gives other variables a chance, which helps decorrelate the trees.

We can use ranger() from the ranger package to train an RF model.

Another compelling R package for RF is the randomForest package.

The ranger() function has many options you can specify that determine how trees are built. Here are some of the important ones (see here for the complete description of the hyper-parameters.):

  • mtry: the number of variables considered in each split (default is the square root of the total numbers of explanatory variables rounded down.)
  • num.trees: the number of tree to be built (default is 500)
  • min.node.size: minimum number of observations in each node (default varies based on the the type of analysis)
  • replace: where sample with or without replacement when bootstrapping samples (default is TRUE)
  • sample.fraction: the fraction of the entire observations that are used in each tree (default is 1 if sampling with replacement, 0.632 if sampling without replacement)

Let’s try fitting an RF with ranger() with the default parameters.

Code
#=== load the package ===#
library(ranger)

Attaching package: 'ranger'
The following object is masked from 'package:rattle':

    importance
Code
#=== fit and RF ===#
(
rf_fit <- 
  ranger(
    lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, 
    data = mlb1_dt
  )
)
Ranger result

Call:
 ranger(lsalary ~ hruns + years + rbisyr + allstar + runsyr +      hits + bavg, data = mlb1_dt) 

Type:                             Regression 
Number of trees:                  500 
Sample size:                      330 
Number of independent variables:  7 
Mtry:                             2 
Target node size:                 5 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.3668547 
R squared (OOB):                  0.7288123 

Since we have many trees, it is no longer possible to have a nice graphical representation of the trained RF model like we did with a regression tree.

In the output, you can see OOB prediction error (MSE). OOB stands for out-of-bag. When bootstrapping (whether you do it with replacement or not), some of the train data will not be used to build a tree.

Code
n_obs <- nrow(mlb1_dt)

#=== bootstrapped data ===#
boot_data <- mlb1_dt[sample(1:n_obs, n_obs, replace = TRUE), ]

#=== which rows (observations) from the original datasets are missing? ===#
mlb1_dt[, id %in% unique(boot_data$id)] %>% mean()
[1] 0.6212121

So, only \(65\%\) of the rows from the original data (mlb1_dt) in this bootstrapped sample (many duplicates of the original observations). The observations that are NOT included in the bootstrapped sample is called out-of-bag observations. This provides a great opportunity to estimate test MSE while training an RF model! For a given regression tree, you can apply it to the out-of-bag samples to calculate MSE. You can repeat this for all the trees and average the MSEs, effectively conducting cross-validation. When the number of trees is large enough, OOB MSE is almost equivalent to MSE from LOOCV [@james2013introduction]. This means that we can tune hyper-parameters by comparing OOB MSEs of different configurations of them.

You can use a simple grid-search to find the best hyper-parameters. Grid-search is simply a brute-force optimization methods that goes through all the combinations of hyper-parameters and see which combination comes at the top. The computational intensity of grid-search depends on how many hyper-parameters you want to vary and how many values you would like to look at for each of the hyper-parameters. Here, let’s tune mtry, min.node.size, and sample.fraction.

Code
#=== define set of values you want to look at ===#
mtry_seq <- c(2, 4, 7)
min_node_size_seq <- c(2, 5, 10)
sample_fraction_seq <- c(0.5, 0.75, 1)

#=== create a complete combinations of the three parameters ===#
(
parameters <-
  data.table::CJ(
    mtry = mtry_seq,
    min_node_size = min_node_size_seq,
    sample_fraction = sample_fraction_seq
  )
)
    mtry min_node_size sample_fraction
 1:    2             2            0.50
 2:    2             2            0.75
 3:    2             2            1.00
 4:    2             5            0.50
 5:    2             5            0.75
 6:    2             5            1.00
 7:    2            10            0.50
 8:    2            10            0.75
 9:    2            10            1.00
10:    4             2            0.50
11:    4             2            0.75
12:    4             2            1.00
13:    4             5            0.50
14:    4             5            0.75
15:    4             5            1.00
16:    4            10            0.50
17:    4            10            0.75
18:    4            10            1.00
19:    7             2            0.50
20:    7             2            0.75
21:    7             2            1.00
22:    7             5            0.50
23:    7             5            0.75
24:    7             5            1.00
25:    7            10            0.50
26:    7            10            0.75
27:    7            10            1.00
    mtry min_node_size sample_fraction

In total, we have 27 (\(3 \times 3 \times 3\)) cases. You can see how quickly the number of cases increases as you increase the number of parameters to tune and the values of each parameter. We can now loop over the rows of this parameter data (parameters) and get OOB MSE for each of them.

Code
oob_mse_all <-
  lapply(
    seq_len(nrow(parameters)),
    function(x) {

      #=== Fit the mode ===#
      rf_fit <- 
        ranger(
          lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, 
          data = mlb1_dt,
          num.trees = 1000,
          mtry = parameters[x, mtry],
          min.node.size = parameters[x, min_node_size],
          sample.fraction = parameters[x, sample_fraction]
        )

      #=== return OOB SME ===#
      return(rf_fit$prediction.error)
      
    }
  ) %>% 
  unlist()

#=== assign OOB MSE to the parameters data ===#
parameters[, oob_mse := oob_mse_all]

#=== take a look ===#
parameters
    mtry min_node_size sample_fraction   oob_mse
 1:    2             2            0.50 0.3634654
 2:    2             2            0.75 0.3632539
 3:    2             2            1.00 0.3703735
 4:    2             5            0.50 0.3609772
 5:    2             5            0.75 0.3640391
 6:    2             5            1.00 0.3657121
 7:    2            10            0.50 0.3632176
 8:    2            10            0.75 0.3590104
 9:    2            10            1.00 0.3647577
10:    4             2            0.50 0.3639452
11:    4             2            0.75 0.3678789
12:    4             2            1.00 0.3762502
13:    4             5            0.50 0.3625460
14:    4             5            0.75 0.3682806
15:    4             5            1.00 0.3696961
16:    4            10            0.50 0.3632466
17:    4            10            0.75 0.3641201
18:    4            10            1.00 0.3693227
19:    7             2            0.50 0.3697563
20:    7             2            0.75 0.3773171
21:    7             2            1.00 0.3807321
22:    7             5            0.50 0.3749875
23:    7             5            0.75 0.3731546
24:    7             5            1.00 0.3857696
25:    7            10            0.50 0.3688267
26:    7            10            0.75 0.3758684
27:    7            10            1.00 0.3764331
    mtry min_node_size sample_fraction   oob_mse

So, the best choice among the ones tried is:

Code
parameters[which.min(oob_mse), ]
   mtry min_node_size sample_fraction   oob_mse
1:    2            10            0.75 0.3590104

Boosted Regression Forest

Gradient Boosting

In training RF that uses the idea of bagging, the original data is used to generate many bootstrapped datasets, a regression tree is trained on each of them independently , and then they are averaged to reach the final model. Boosting is similar to bagging (bootstrap aggregation) in that it trains many statistical models and then combine them to reach the final model. However, instead of building many trees independently, it builds trees sequentially in a manner that improves prediction step by step.

While there are many variants of boosting methods (see Chapter 10 of @hastie2009elements), we will look at gradient boosting using trees for regression in particular (Algorithm 10.3 in @hastie2009elements presents the generic gradient tree boosting algorithm), where squared error is used as the loss function.

  1. Set \(f_0(X_i) = \frac{\sum_{i=1}^N y_i}{N}\) for all \(i = 1, \dots, N\)
  2. For b = 1 to B,
  1. For \(i = 1, \dots, N\), calculate \[ r_{i,b} = (y_i - f_{b-1}(X_i)) \]
  2. Fit a regression tree to \(r_{i, b}\), which generates terminal regions \(R_{j,b}\), \(j = 1, \dots, J\), and denote the predicted value of region \(R_{j,b}\) as \(\gamma_{j,b}\).
  3. Set \(f_b(X_i) = f_{b-1}(X_i) + \lambda \cdot \sum_{j=1}^J\gamma_{j, b}\cdot I(X_i \in R_{j,b})\)
  1. Finally, \(\hat{f}(X_i) = f_B(X_i)\)

Let’s try to go through this algorithm a bit to have it sink in for you.

Step 1

Step 1 finds the mean of the dependent variable. This quantity is used as the starting estimate for the dependent variable.

Code
(
f_0 <- mean(mlb1_dt$lsalary)
)
[1] 13.51172

Step 2: \(b = 1\)

Now, we get residuals:

Code
mlb1_dt[, resid_1 := lsalary - f_0]

The residuals contain information in lsalary that was left unexplained. By training a regression tree using the residuals as the dependent variable, we are finding a tree that can explain the unexplained parts of lsalary using the explanatory variables.

Code
tree_fit_b1 <- 
  rpart(
    resid_1 ~ ., # . means all variables
    data = mlb1_dt 
  )

Here is the fitted value of the residuals (\(\sum_{j=1}^J\gamma_{j, b}\cdot I(X_i \in R_{j,b})\))

Code
resid_1_hat <- predict(tree_fit_b1, newdata = mlb1_dt)
head(resid_1_hat)
         1          2          3          4          5          6 
 1.7134881  1.7134881  1.2414996  1.2414996  0.5054178 -0.1851016 

Now, we update our prediction according to \(f_b(X_i) = f_{b-1}(X_i) + \lambda \cdot \sum_{j=1}^J\gamma_{j, b}\cdot I(X_i \in R_{j,b})\). We set \(\lambda\) to be \(0.2\) in this illustration.

Code
lambda <- 0.2
f_1 <- f_0 + lambda * resid_1_hat
head(f_1)
       1        2        3        4        5        6 
13.85441 13.85441 13.76002 13.76002 13.61280 13.47470 

Did we actually improve prediction accuracy? Let’s compare f_0 and f_1.

Code
sum((mlb1_dt$lsalary - f_0)^2)
[1] 445.0615
Code
sum((mlb1_dt$lsalary - f_1)^2)
[1] 288.3205

Great. Let’s move on to \(b = 2\).

Code
#=== get negative of the residuals ===#
mlb1_dt[, resid_2 := lsalary - f_1]

#=== fit a regression tree ===#
tree_fit_b2 <- 
  rpart(
    resid_2 ~ ., # . means all variables
    data = mlb1_dt 
  )

#=== get predicted values ===#
resid_2_hat <- predict(tree_fit_b2, newdata = mlb1_dt)

#=== update ===#
f_2 <- f_1 + lambda * resid_2_hat
Code
sum((mlb1_dt$lsalary - f_1)^2)
[1] 288.3205
Code
sum((mlb1_dt$lsalary - f_2)^2)
[1] 186.9229

We further improved our predictions. We repeat this process until certain user-specified stopping criteria is met.

As you probably have noticed, there are several key parameters in the process above that controls the performance of gradient boosting forest. \(\lambda\) controls the speed of learning. The lower \(\lambda\) is, slower the learning speed is. \(B\) (the number of trees) determines how many times we want to make small improvements to the original prediction. When you increase the value of \(\lambda\), you should decrease the value of \(B\). Too high values of \(\lambda\) and \(B\) can lead to over-fitting.

You may have been wondering why this algorithm is called Gradient boosting. Gradient boosting is a much more general than the one described here particularly for gradient tree boosting for regression. It can be applied to both regression and classification1. In general, Step 2.a can be written as follows:

\[ r_{i,b} = - \huge[\normalsize\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\huge]\normalsize_{f = f_{b-1}} \]

where \(L(y_i, f(x_i))\) is the loss function. For regression, the loss function is almost always squared error: \((y_i - f(x_i))^2\). For, \(L(y_i, f(x_i)) = (y_i - f(x_i))^2\), the negative of the derivative of the loss function with respect to \(f(x_i)\) is

\[ - \huge[\normalsize\frac{\partial L(y_i, f(x_i))}{\partial f(x_i)}\huge]\normalsize_{f = f_{b-1}} = - (- 2 (y_i - f(x_i))) = 2 (y_i - f(x_i)) \]

This is why we have \(r_{i,b} = (y_i - f_{b-1}(X_i))\) at Step 2.a. And, as you just saw, we are using the gradient of the loss function for model updating, which is why it is called gradient boosting. Note that it does not really matter whether you have \(2\) in front of the residuals or not the fitted residuals is multiplied (scaled) by \(\lambda\) to when updating the model. You can always find the same \(\lambda\) that would result in the same results as when just non-scaled residuals are used.

Most R and python packages allow you to use a fraction of the train sample that are randomly selected and/or to use a subset of the included variables in building a tree within Step 2. This generate randomness in the algorithm and they are referred to as stochastic gradient boosting.

Implementation in R

We can use the gbm package to train a gradient boosting regression. Just like ranger(), gbm takes formula and data like below.

Code
library(gbm)
Loaded gbm 2.1.8
Code
#=== fit a gbm model ===#
gbm_fit <- 
  gbm(
    lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, 
    data = mlb1_dt 
  )
Distribution not specified, assuming gaussian ...

Here is the list of some parameters to be aware of:

  • n.trees: Number of trees (\(B\)). Default is \(100\).
  • interaction.depth: 1 implies an additive model without interactions between included variables2, 2 implies a model with 2-way interactions. Default is 1.
  • n.minobsinnode: Minimum number of observations in the terminal nodes.
  • shrinkage: Learning rate (\(\lambda\)). Default is 0.1.
  • bag.fraction: The fraction of the train data observations that are select randomly in building a tree. Default is 0.5.
  • cv.folds: The number of folds in conducting KCV

By specifying cv.folds, gbm() automatically conducts cross-validation for you.

Code
#=== gbm fit with CV ===#
gbm_fit <- 
  gbm(
    lsalary ~ hruns + years + rbisyr + allstar + runsyr + hits + bavg, # . means all variables
    data = mlb1_dt,
    cv.folds = 5,
  )
Distribution not specified, assuming gaussian ...
Code
#=== see the MSE history ===#  
gbm_fit$cv.error
  [1] 1.2262241 1.1179341 1.0228205 0.9368023 0.8703427 0.8081771 0.7587331
  [8] 0.7184427 0.6784681 0.6488203 0.6220356 0.5944452 0.5684697 0.5462641
 [15] 0.5307501 0.5141117 0.4979657 0.4827620 0.4730010 0.4581580 0.4504190
 [22] 0.4443921 0.4371236 0.4309866 0.4229345 0.4173295 0.4134676 0.4087607
 [29] 0.4060998 0.4026218 0.3994747 0.3964138 0.3928210 0.3916108 0.3900712
 [36] 0.3871619 0.3863775 0.3846834 0.3842239 0.3805245 0.3795898 0.3791349
 [43] 0.3755839 0.3744459 0.3740001 0.3739954 0.3732254 0.3739662 0.3735692
 [50] 0.3727525 0.3696160 0.3695747 0.3697210 0.3678560 0.3674522 0.3653063
 [57] 0.3658336 0.3640528 0.3627853 0.3626865 0.3632326 0.3637658 0.3628768
 [64] 0.3606575 0.3606048 0.3615596 0.3613531 0.3611161 0.3618721 0.3626175
 [71] 0.3622998 0.3624336 0.3617014 0.3624890 0.3621559 0.3625732 0.3626680
 [78] 0.3625827 0.3620569 0.3613892 0.3611959 0.3609769 0.3610368 0.3599778
 [85] 0.3599650 0.3598605 0.3612187 0.3607883 0.3617094 0.3611722 0.3600410
 [92] 0.3604629 0.3595301 0.3591031 0.3587180 0.3575208 0.3580529 0.3586332
 [99] 0.3584507 0.3585487

You can visualize the CV results using gbm.perf().

Code
gbm.perf(gbm_fit)

[1] 96

Note that it will tell you what the optimal number of trees is given the values of the other hyper-parameters (here default values). If you want to tune other parameters as well, you need to program it yourself.

Extreme Gradient Boosting

Extreme gradient boosting (XGB) is a variant of gradient boosting that has been extremely popular due to its superb performance. The basic concept is the same as the gradient boosting algorithm described above, however, it has its own way of building a tree, which is more mindful of avoiding over-fitting trees.

Tree updating in XGB (general)

Let \(f_{i,b}(x_i)\) be the prediction for the \(i\)th observation at the \(b\)-th iteration. Further, let \(w_t(x_i)\) is the term that is added to \(f_{i,b}(x_i)\) to obtain \(f_{i,b+1}(x_i)\). In XGB, \(w_t(x_i)\) is such that it minimizes the following objective:

\[ \Psi_t = \sum_{i=1}^N [L(y_i, f_{i,b}(x_i) + w_t(x_i))] + \Omega(w_t) \tag{1}\]

where \(L()\) is the user-specified loss-function that is differentiable and \(\Omega(w_t)\) is the regularization term. Instead of Equation 1, XGB uses the second order Taylor expansion of \(L()\) about \(w\)3.

\[ \tilde{\Psi}_t = \sum_{i=1}^N [L(y_i, f_{i,b}(x_i)) + g_i w_t(x_i) + \frac{1}{2}h_i w_t(x_i)^2] + \Omega(w_t) \tag{2}\]

where \(g_i = \frac{\partial L(y_i, p_i)}{\partial p_i}\) (first-order derivative) and \(h_i = \frac{\partial^2 L(y_i, p_i)}{\partial p_i^2}\) (second-order derivative). Since \(L(y_i, f_{i,b}(x_i))\) is just a constant, we can safely remove it from the objective function, which leads to

\[ \tilde{\Psi}_t = \sum_{i=1}^N [g_i w_t(x_i) + \frac{1}{2}h_i w_t(x_i)^2] + \Omega(w_t) \tag{3}\]

Let \(I_j\) denote a set of observations that belong to leaf \(j\) (\(j = 1, \dots, J\)). Then, Equation 3 is written as follows:

\[ \tilde{\Psi}_t = \sum_{j=1}^J\huge[\normalsize (\sum_{i\in I_j}g_i)w_j + \frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)w_j^2 \huge]\normalsize + \gamma J \tag{4}\]

Remember that all the observations in the same leaf shares the same prediction. So, for all \(i\)s that belong to leaf \(j\), the prediction is denoted as \(w_j\) in Equation 4. That is, \(w_t(x_i)\) that belongs to leaf \(j\) is \(w_j\).

For a given tree structure (denoted as \(q(x)\)), the leaves can be treated independently in minimizing this objective.

Taking the derivative of \(\tilde{\Psi}_t\) w.r.t \(w_j\),

\[ \begin{aligned} (\sum_{i\in I_j}g_i) + (\sum_{i\in I_j}h_i + \lambda)w_j = 0 \\ \Rightarrow w_j^* = \frac{-\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i + \lambda} \end{aligned} \tag{5}\]

The minimized value of \(\tilde{\Psi}_t\) is then (obtained by plugging \(w_j^*\) into Equation 4),

\[ \begin{aligned} \tilde{\Psi}_t(q)^* & = \sum_{j=1}^J\huge[\normalsize (\sum_{i\in I_j}g_i)\frac{-\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i + \lambda} + \frac{1}{2}(\sum_{i\in I_j}h_i + \lambda)(\frac{-\sum_{i\in I_j}g_i}{\sum_{i\in I_j}h_i + \lambda})^2 \huge]\normalsize + \gamma J \\ & = \sum_{j=1}^J\huge[\normalsize \frac{-(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i + \lambda} + \frac{1}{2}\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i + \lambda} \huge]\normalsize + \gamma J \\ & = -\frac{1}{2} \sum_{j=1}^J \huge[\normalsize\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i + \lambda}\huge]\normalsize + \gamma J \end{aligned} \tag{6}\]

For notatinal convenience, we call \(\frac{(\sum_{i\in I_j}g_i)^2}{\sum_{i\in I_j}h_i + \lambda}\) quality score and denote it by \(Q_j\) ( Quality score for leaf \(j\)).

We could find the best tree structure by finding \(w_j^*(q)\) according to Equation 4 and calculate \(\tilde{\Psi}_t(q)^*\) according to Equation 6 for each of all the possible tree structures, and then pick the tree structure q(x) that has the lowest \(\tilde{\Psi}_t(q)^*\).

However, it is impossible to consider all possible tree structures practically. So, a greedy (myopic) approach that starts from a single leaf and iteratively splits leaves is used instead.

Consider splitting an existing leaf \(s\) (where in the tree it may be located) into two leaves \(L\) and \(R\) when there are \(J\) existing leaves. Then, we find \(w_j^*\) and calculate \(\tilde{\Psi}_t(q)^*\) for each leaf, and the resulting minimized objective is

\[ -\frac{1}{2} \huge[\normalsize Q_L + Q_R + \Gamma \huge]\normalsize + \gamma(J+1) \]

where \(\Gamma\) is the sum of quality scores for all the leaves except \(L\) and \(R\).

\[ \Gamma = \sum_{j\ne \{L, R\}}^J Q_j \]

The minimized objective before splitting is

\[ -\frac{1}{2} \huge[\normalsize Q_s + \Gamma \huge]\normalsize + \gamma J \]

So, the reduction in loss after the split is

\[ G(s, L, R) = \frac{1}{2} \huge[\normalsize Q_L + Q_R - Q_s \huge]\normalsize - \gamma \]

Let’s call \(G(s, L, R)\) simply a gain (of the split).

A more positive value of gain (\(G(s, L, R)\)) means a more successful split.

We can try many different patterns of \(I_L\) and \(I_R\) (how to split tree \(s\)), calculate the gain for each of them and pick the split that has the highest gain.

Different patterns of \(I_L\) and \(I_R\) arise from different variable-cutpoint combinations

If the highest gain is negative, then the leaf under consideration for splitting is not split.

Once the best tree is chosen (the tree that has the highest gain among the ones investigated), then we update our prediction based on \(w^*\) of the tree. For observation \(i\) that belongs to leaf \(j\) of the tree,

\[ \begin{aligned} f_{i,b} = f_{i,b-1} + \eta \cdot w_j^* \end{aligned} \tag{7}\]

where \(\eta\) is the learning rate.

Tree updating in XGB (regression)

We now make the general tree updating algorithm specific to regression problems, where the loss function is squared error: \(L(y_i, p_i) = \frac{1}{2}(y_i - p_i)^2\), where \(p_i\) is the predicted value for \(i\).

First, let’s find \(g_i\) and \(h_i\) for \(L(y_i, p_i) = \frac{1}{2}(y_i - p_i)^2\).

\[ \begin{aligned} g_i = \frac{\partial L(y_i, p_i)}{\partial p_i} = -(y_i - p_i)\\ h_i = \frac{\partial^2 L(y_i, p_i)}{\partial p_i^2} = 1 \\ \end{aligned} \]

So, \(g_i\) is simply the negative of the residual for \(i\).

Now, suppose your are at iteration \(b\) and the predicted value for \(i\) is denoted as \(f_{i,b}(x_i)\). Further, let \(r_{i,b}\) denote the residual (\(y_i - f_{i,b}(x_i)\)).

Plugging these into Equation 5,

\[ \begin{aligned} w_j^* & = \frac{\sum_{i\in I_j}r_{i,b}}{\sum_{i\in I_j}1 + \lambda} \\ & = \frac{\sum_{i\in I_j}r_{i,b}}{N_j + \lambda} \end{aligned} \tag{8}\]

That is for a given leaf \(j\), the optimal predicted value (\(w_j^*\)) is the sum of the residuals of all the observations in leaf \(j\) divided by the number of observations in leaf \(j\) plus \(\lambda\). When \(\lambda = 0\), the optimal predicted value (\(w_j^*\)) is simply the mean of the residuals.

The quality score for leaf \(j\) is then,

\[ Q_j = \frac{(\sum_{i\in I_j}r_{i,b})^2}{N_j + \lambda} \tag{9}\]

Illustration of XGB for regression

In order to further our understanding of the entire XGB algorithm, let’s take a lookt at a simple regression problem as an illustration. We consider a four-observation data as follows:

Code
(
data <-
  data.table(
    y = c(-3, 7, 8, 12),
    x = c(1, 4, 6, 8)
  )
)
    y x
1: -3 1
2:  7 4
3:  8 6
4: 12 8
Code
(
g_0 <-
  ggplot(data) +
  geom_point(aes(y = y, x = x))
)

First step (\(b = 0\)) is to make an initial prediction. This can be any number, but let’s use the mean of y and set it as the predicted value for all the observations.

Code
(
f_0 <- mean(data$y) # f_0: the predicted value for all the observations
)
[1] 6

Let’s set \(\gamma\), \(\lambda\), and \(\eta\) to \(10\), \(1\), and \(0.3\), respectively.

Code
gamma <- 10
lambda <- 1
eta <- 0.3

We have a single-leaf tree at the moment. And the quality score for this leaf is

quality score for leaf \(j\) is \(\frac{(\sum_{i\in I_j}r_{i,b})^2}{N_j + \lambda}\)

Code
#=== get residuals ===#
data[, resid := y - f_0]

#=== get quality score ===#
(
q_0 <- (sum(data$resid))^2/(nrow(data) + 1)
)
[1] 0

Quality score of the leaf is 0.

Since we are using the mean of \(y\) as the prediction, of course, the sum of the residuals is zero, which then means that the quality score is zero.

Now, we have three potential to split patterns: {x, 2}, {x, 5}, {x, 7}.

{x, 2} means the leaf is split into two leaves: \({x | x <2}\) and \({x | x >= 2}\). Note that any number between \(1\) and \(4\) will result in the same split results.

Let’s consider them one by one.

Split: {x, 2}

Here are graphical representations of the split:

Code
g_0 +
  geom_vline(xintercept = 2, color = "red") +
  annotate("text", x = 1.25, y = 6, label = "leaf L", color = "red") +
  annotate("text", x = 5, y = 6, label = "leaf R", color = "red")

Code
DiagrammeR::grViz(
"
digraph {
  graph [ranksep = 0.2]
  node [shape = box]
    T1R [label = 'L: -9']
    T1L [label = 'R: 1 , 2 , 6']
    T0 [label = '-9, 1 , 2 , 6']
  edge [minlen = 2]
    T0->T1L
    T0->T1R
  { rank = same; T1R; T1L}
}
"
)

Let’s split the data.

Code
#=== leaf L ===#
(
data_L_1 <- data[x < 2, ]
)
    y x resid
1: -3 1    -9
Code
#=== leaf R ===#
(
data_R_1 <- data[x >= 2, ]
)
    y x resid
1:  7 4     1
2:  8 6     2
3: 12 8     6

Using Equation 8,

\(w_j^* = \frac{\sum_{i\in I_j}r_{i,b}}{N_j + \lambda}\)

Code
w_L <- (sum(data_L_1$resid))/(nrow(data_L_1) + lambda)
w_R <- (sum(data_R_1$resid))/(nrow(data_R_1) + lambda)

\[ \begin{aligned} w_L^* & = -9 / (1 + 1) = -4.5 \\ w_R^* & = 1 + 2 + 6 / (3 + 1) = 2.25 \end{aligned} \]

Using Equation 9, the quality scores for the leaves are

\(Q_j = \frac{(\sum_{i\in I_j}r_{i,b})^2}{N_j + \lambda}\)

Code
q_L <- (sum(data_L_1$resid))^2/(nrow(data_L_1) + lambda)
q_R <- (sum(data_R_1$resid))^2/(nrow(data_R_1) + lambda)
Code
DiagrammeR::grViz(
  paste0(
  "
  digraph {
    graph [ranksep = 0.2]
    node [shape = box]
      T1R [label = 'L: -9 \n Q score = ", round(q_L, digits = 2), "']
      T1L [label = 'R: 1 , 2 , 6 \n Q score = ", round(q_R, digits = 2), "']
      T0 [label = '-9, 1 , 2 , 6']
    edge [minlen = 2]
      T0->T1L
      T0->T1R
    { rank = same; T1R; T1L}
  }
  "
  )

)

\[ \begin{aligned} q_L^* & = (-9)^2 / (1 + 1) = 40.5 \\ q_R^* & = (1 + 2 + 6)^2 / (3 + 1) = 20.25 \end{aligned} \]

Notice that residuals are first summed and then squared in the denominator of the quality score (the higher, the better). This means that if the prediction is off in the same direction (meaning they are similar) among the observations within the leaf, then the quality score is higher. On the other hand, if the prediction is off in both directions (meaning they are not similar), then the residuals cancel each other out, resulting in a lower quality score. Since we would like to create leaves consisting of similar observations, a more successful split has a higher quality score.

Finally, the gain of this split is

\[ G(s, L, R) = \frac{1}{2} \huge[\normalsize Q_L + Q_R - Q_s \huge]\normalsize - \gamma \]

where \(s\) is the leaf before split, \(L\) and \(R\) are leaves after the split of leaf \(s\).

Code
gain_1 <- (q_L + q_R - q_0)/2 - gamma

\[ G_1 = \frac{40.5 + 20.25 - 0}{2} - 10 = 20.375 \]

Now that we have gone through the process of finding update value (\(w\)), quality score (\(q\)), and gain (\(G\)) for a given split structure, let’s write a function that returns the values of these measures by feeding the cutpoint before moving onto the next split candidate.

Code
get_info <- function(data, cutpoint, lambda, gamma)
{
  q_0 <- (sum(data$resid))^2/(nrow(data) + lambda)

  data_L <- data[x < cutpoint, ]
  data_R <- data[x >= cutpoint, ]

  w_L <- (sum(data_L$resid))/(nrow(data_L) + lambda)
  w_R <- (sum(data_R$resid))/(nrow(data_R) + lambda)

  q_L <- (sum(data_L$resid))^2/(nrow(data_L) + lambda)
  q_R <- (sum(data_R$resid))^2/(nrow(data_R) + lambda)

  gain <- (q_L + q_R - q_0)/2 - gamma

  return(list(
    w_L = w_L, 
    w_R = w_R, 
    q_L = q_L, 
    q_R = q_R, 
    gain = gain 
  ))
}

Split: {x, 5}

Code
measures_2 <- get_info(data, 5, lambda, gamma)
Code
g_0 +
  geom_vline(xintercept = 5, color = "red") +
  annotate("text", x = 3, y = 6, label = "leaf L", color = "red") +
  annotate("text", x = 7, y = 6, label = "leaf R", color = "red")

Code
DiagrammeR::grViz(
  paste0(
    "
    digraph {
      graph [ranksep = 0.2]
      node [shape = box]
        T1R [label = 'L: -9, 1 \n Q score = ", round(measures_2$q_L, digits = 2), "']
        T1L [label = 'R: 2 , 6 \n Q score = ", round(measures_2$q_R, digits = 2), "']
        T0 [label = '-9, 1 , 2 , 6']
      edge [minlen = 2]
        T0->T1L
        T0->T1R
      { rank = same; T1R; T1L}
    }
    "
  )
)

\[ \begin{aligned} q_L^* & = (-9)^2 / (2 + 1) = 21.33 \\ q_R^* & = (1 + 2 + 6)^2 / (2 + 1) = 21.33 \end{aligned} \]

\[ G_2 = \frac{21.33 + 21.33 - 0}{2} - 10 = 11.3333333 \]

Split: {x, 7}

Code
measures_3 <- get_info(data, 7, lambda, gamma)
Code
g_0 +
  geom_vline(xintercept = 7, color = "red") +
  annotate("text", x = 4, y = 6, label = "leaf L", color = "red") +
  annotate("text", x = 8, y = 6, label = "leaf R", color = "red")

Code
DiagrammeR::grViz(
  paste0(
    "
    digraph {
      graph [ranksep = 0.2]
      node [shape = box]
        T1R [label = 'L: -9, 1, 2 \n Q score = ", round(measures_3$q_L, digits = 2), "']
        T1L [label = 'R: 6 \n Q score = ", round(measures_3$q_R, digits = 2), "']
        T0 [label = '-9, 1 , 2 , 6']
      edge [minlen = 2]
        T0->T1L
        T0->T1R
      { rank = same; T1R; T1L}
    }
    "
  )
)

\[ \begin{aligned} q_L^* & = (-9)^2 / (1 + 1) = 9 \\ q_R^* & = (1 + 2 + 6)^2 / (3 + 1) = 18 \end{aligned} \]

\[ G_3 = \frac{9 + 18 - 0}{2} - 10 = 3.5 \]

Among all the splits we considered, the first case (Split: {x, 2}) has the highest score. This is easy to confirm visually and shows picking a split based on the gain measure indeed makes sense.

Now we consider how to split leaf R (leaf L cannot be split further as it has only one observation). We have two split candidates: {x, 5} and {x, 7}. Let’s get the gain measures using get_info().

Code
#=== first split ===#
get_info(data_R_1, 5, lambda, gamma)$gain 
[1] -9.208333
Code
#=== second split ===#
get_info(data_R_1, 7, lambda, gamma)$gain
[1] -9.625

So, neither of the splits has a positive gain value. Therefore, we do not adopt either of the splits. For this iteration (\(b=1\)), this is the end of tree building.

Note

If the value of \(\gamma\) is lower (say, 0), then we would have adopted the second split.

Code
get_info(data_R_1, 5, lambda, 0)$gain # first split
[1] 0.7916667
Code
get_info(data_R_1, 7, lambda, 0)$gain # second split
[1] 0.375

As you can see, a higher value of \(\gamma\) leads to a more aggressive tree pruning.

So, the final tree for this iteration (\(b = 1\)) is

Code
DiagrammeR::grViz(
  paste0(
  "
  digraph {
    graph [ranksep = 0.2]
    node [shape = box, width = 0.3, height = 0.15, fontsize = 3, fixedsize = TRUE, penwidth = 0.2]
      T1R [label = 'L: -9 \n w* = ", round(w_L, digits = 2), "']
      T1L [label = 'R: 1 , 2 , 6 \n w* = ", round(w_R, digits = 2), "']
      T0 [label = '-9, 1 , 2 , 6']
    edge [penwidth = 0.2, arrowsize = 0.3, len = 0.3]
      T0->T1L
      T0->T1R
    { rank = same; T1R; T1L}
  }
  "
  )

)

We now use \(w^*\) from this tree to update our prediction according to Equation 7.

\(f_{i,b} = f_{i,b-1} + \eta \cdot w_j^*\)

Code
measures_1 <- get_info(data, 2, lambda, gamma)

Since the first observation is in \(L\),

\[ f_{i = 1,b = 1} = 6 + 0.3 \times -4.5 = 4.65 \]

Since the second, third, and fourth observations are in \(R\),

\[ \begin{aligned} f_{i = 2,b = 1} = 6 + 0.3 \times 2.25 = 6.68 \\ f_{i = 3,b = 1} = 6 + 0.3 \times 2.25 = 6.68\\ f_{i = 4,b = 1} = 6 + 0.3 \times 2.25 = 6.68 \end{aligned} \]

Code
data %>% 
  .[, f_0 := f_0] %>% 
  .[1, f_1 := f_0 + measures_1$w_L * eta] %>%
  .[2:4, f_1 := f_0 + measures_1$w_R * eta]

The prediction updates can be seen below. Though small, we made small improvements in our prediction.

Code
ggplot(data = data) +
  geom_point(aes(y = y, x = x, color = "observed")) +
  geom_point(aes(y = f_1, x = x, color = "after (f1)")) +
  geom_point(aes(y = f_0, x = x, color = "before (f0)")) +
  scale_color_manual(
    values = 
      c(
        "before (f0)" = "blue", 
        "after (f1)" = "red",
        "observed" = "black"
      ), 
    name = ""
  ) +
  geom_segment(
    aes(y = f_0, x = x, yend = f_1, xend = x), 
    color = "blue",
    arrow = arrow(length = unit(0.1, "cm"))
  ) +
  theme_bw()

Now, we move on to \(b=2\). We first update residuals:

Code
data[, resid := y - f_1]

data
    y x  resid f_0   f_1
1: -3 1 -7.650   6 4.650
2:  7 4  0.325   6 6.675
3:  8 6  1.325   6 6.675
4: 12 8  5.325   6 6.675

Just like at \(b=1\), all the possible splits are {x, 2}, {x, 5}, {x, 7}. Let’s find the gain for each split.

Code
lapply(
  c(2, 5, 7),
  function(x) get_info(data, x, lambda, gamma)$gain
)
[[1]]
[1] 10.66639

[[2]]
[1] 6.267458

[[3]]
[1] 1.543344

So, the first split is again the best split. Should we split the right leaf, which has the observations except the first one?

Code
lapply(
  c(5, 7),
  function(x) get_info(data[2:3, ], x, lambda, gamma)$gain
)
[[1]]
[1] -9.988437

[[2]]
[1] -10

All the splits have negative gains. So, we do not split this leaf just like at \(b=1\).

So, the final tree for this iteration (\(b = 1\)) is

Code
measures_b2 <- get_info(data, 2, lambda, gamma)

#| code-fold: true
#| fig-height: 2
#| fig-width: 4

DiagrammeR::grViz(
  paste0(
  "
  digraph {
    graph [ranksep = 0.2]
    node [shape = box, width = 0.4, height = 0.15, fontsize = 3, fixedsize = TRUE, penwidth = 0.2]
      T1R [label = 'L: -8.18 \n w* = ", round(measures_b2$w_L, digits = 2), "']
      T1L [label = 'R: 0.71 , 1.71 , 5.71 \n w* = ", round(measures_b2$w_R, digits = 2), "']
      T0 [label = '-8.18, 0.71 , 1.71 , 5.71']
    edge [penwidth = 0.2, arrowsize = 0.3, len = 0.3]
      T0->T1L
      T0->T1R
    { rank = same; T1R; T1L}
  }
  "
  )

)

Let’s now update our predictions.

Code
data %>% 
  .[1, f_2 := f_1 + measures_b2$w_L * eta] %>%  
  .[2:4, f_2 := f_1 + measures_b2$w_R * eta] 
Code
ggplot(data = data) +
  geom_point(aes(y = y, x = x, color = "observed")) +
  geom_point(aes(y = f_2, x = x, color = "f2")) +
  geom_point(aes(y = f_1, x = x, color = "f1")) +
  geom_point(aes(y = f_0, x = x, color = "f0")) +
  scale_color_manual(
    values = 
      c(
        "f0" = "blue", 
        "f1" = "red",
        "f2" = "red",
        "observed" = "black"
      ), 
    name = ""
  ) +
  geom_segment(
    aes(y = f_0, x = x, yend = f_1, xend = x), 
    color = "blue",
    arrow = arrow(length = unit(0.1, "cm"))
  ) +
  geom_segment(
    aes(y = f_1, x = x, yend = f_2, xend = x), 
    color = "blue",
    arrow = arrow(length = unit(0.1, "cm"))
  ) +
  theme_bw()

Again, we made small improvements in our predictions. This process continues until user-specified stopping criteria is met.

Tip
  • \(\lambda\):
    • A higher value of \(\lambda\) leads to a lower value of prediction updates (\(w^*\)).
    • A higher value of \(\lambda\) leads to a lower value of quality score (\(Q\)), thus leading to a lower value of gain (\(G\)), which then leads to more aggressive pruning for a given value of \(\gamma\).
  • \(\gamma\):
    • A higher value of \(\gamma\) leads to more aggressive pruning.
  • \(\eta\):
    • A higher value of \(\eta\) leads to faster learning.

Implementation

You can use the xgboost package to implement XGB modeling.

Code
library(xgboost)

Attaching package: 'xgboost'
The following object is masked from 'package:rattle':

    xgboost
The following object is masked from 'package:dplyr':

    slice

The first task is to create a class of matrix called xgb.DMatrix using the xgb.DMatrix() function. You provide the explanatory variable data matrix to the data option and the dependent variable matrix (vector) to the label option in xgb.DMatrix() like below.

Code
mlb1_dm_X <- 
  xgb.DMatrix(
    data = as.matrix(mlb1_dt[, .(hruns, years, rbisyr, allstar, runsyr, hits, bavg)]),
    label = as.matrix(mlb1_dt[, lsalary])
  )

We can then use xgb.train() to train a model using the XGB algorithm.

Code
xgb_fit <-
  xgb.train(
    data = mlb1_dm_X, # independent variable
    nrounds = 100, # number of iterations (trees to add)
    eta = 1, # learning rate
    objective = "reg:squarederror" # objective function
  )

Over-fitting

Resources

Footnotes

  1. Indeed, all the algorithms and models we have talked about can be applied to classification problems with some small changes.↩︎

  2. You can of course create interactions terms yourself in the data, which would allow simple linear 2-way interactions.↩︎

  3. This helps for some of the commonly used loss functions↩︎